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Abstract 

The issue of partitioning a network into communities has attracted a great deal of atten- 
tion recently. Most authors seem to equate this issue with the one of finding the maximum 
value of the modularity, as defined by Newman. Since the problem formulated this way is 
NP-hard, most effort has gone into the construction of search algorithms, and less to the 
question of other measures of community structures, similarities between various partition- 
ings and the validation with respect to external information. 

Here we concentrate on a class of computer generated networks and on three well-studied 
real networks which constitute a bench-mark for network studies; the karate club, the US 
college football teams and a gene network of yeast. We utilize some standard ways of 
clustering data (originally not designed for finding community structures in networks) and 
show that these classical methods sometimes outperform the newer ones. We discuss var- 
ious measures of the strength of the modular structure, and show by examples features 
and drawbacks. Further, we compare different partitions by applying some graph-theoretic 
concepts of distance, which indicate that one of the quality measures of the degree of modu- 
larity corresponds quite well with the distance from the true partition. Finally, we introduce 
a way to validate the partitionings with respect to external data when the nodes are classi- 
fied but the network structure is unknown. This is here possible since we know everything 
of the computer generated networks, as well as the historical answer to how the karate club 
and the football teams are partitioned in reality. The partitioning of the gene network is val- 
idated by use of the Gene Ontology database, where we show that a community in general 
corresponds to a biological process. 
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1 Introduction 



Complex networks, i.e., assemblies of nodes and edges with nontrivial properties, 
can be used to describe systems in many different fields, such as sociological (sci- 
entific collaborations and structure of organizations, biological (proteins and genes 
interactions and technological (Internet and the web). These systems are composed 
of a large number of interacting agents, and the complexity originate partly from 
the heterogeneity in their interaction patterns. Given this high degree of complex- 
ity, it is often necessary to divide a network into different subgroups to facilitate the 
understanding of the relationship among different components [8,31,23]. Outlines 
of recent work are given in, e.g., [6,4] together with broad discussions of relevant 
literature. 

In recent years there has been an increasing interest in the properties of networks, 
and the property of community structure has attracted great attention. The vertices 
in the networks are often found to cluster into tightly-knit groups with a high den- 
sity of within-group edges and a lower density of between-group edges. Clustering 
techniques have acquired a dominant role among the tools used to decompose the 
network into functional units. Community structure is a topological property of net- 
works and it is linked to the concept of classification of objects in categories. The 
working definition of community is general, but ambiguous. There is no generally 
accepted formal definition, but an informal one is "a subset of nodes within the 
graph such that connections between the nodes are denser than connections with 
the rest of the network", see Fig.l. That is, a community can be seen, depending on 




Fig. 1. Example of a small network with community structure. 

the context, as a class, group, cluster etc. Communities in a social network might 
represent real social groupings, perhaps by interest or background; communities in 
a citation network might represent related persons on a single topic, communities 
in a gene networks might represent specific biological processes, communities on 
the web might represent pages on related topics, and so on. Being able to identify 
these communities could help us to understand and exploit these networks more 
efficiently. 
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Since the issue of community detection has acquired relevance in several fields, 
a great interest has raised on the algorithms used to determine communities in a 
network. Several algorithms to determine communities are known in the literature. 
A survey of the different approaches can be found in [21,20], and especially in [4] 
where these approaches are evaluated with respect to computational cost and sen- 
sitivity. In the present paper, we return to two classical methods of finding groups 
among data — the hierarchical clustering and the K-means. Both of these methods 
are well-known within the applied statistics community [26], but at least the former 
has in some papers been discarded as a less suitable way of finding communities in 
a network [8,20] . One of our results is that this rejection was a bit premature — with 
a suitable metric for the distances between the nodes, the result of such a hierarchi- 
cal clustering can outperform some of the more modern approaches. The K-means 
has not, to the best of our knowledge, been utilized before for finding community 
structures in complex networks. 

The present paper focuses on networks with a single type of vertex and a single type 
of undirected, unweighted edges. As examples, we consider simulated networks 
(originally described in [8]) and three real examples from the literature — Zachary's 
karate club [30], the college football teams in US division one for the year 2000 
[8] and a gene network of the yeast Saccharomyces cervisae [17]. By using dif- 
ferent quality measures, we explore how well the two methods we study perform, 
and by using distance measures for partitions we also investigate how similar the 
communities found by different algorithms are. Finally, we introduce an external 
validation method which works also for only partially known networks. The article 
is written to be as self-contained as possible, which means that we introduce rather 
carefully also concepts that are well described elsewhere in the literature. However, 
several of the sources are scattered among different scientific areas, and therefore 
we consider it as meaningful to repeat them here. Careful references are always 
given, though, when the concepts are not new. 

The idea of the disposition of this article is to carefully introduce new concepts in 
almost direct relation to the networks to which they are applied for the first time. 
This takes the following form: 

• Section 2 introduces the tools we utilize for finding community structures in the 
networks, particularly 

■ metrics for distances between the nodes. 

• clustering algorithms (hierarchical clustering and K-means). 

• Section 3, the karate club network, with the concepts 

• modularity. 

■ Silhouette index. 

• a null hypothesis by rewiring. 

• measures and indices of the similarity of two partitionings. 

• Section 4, computer generated networks, with illustrations of most of the con- 
cepts introduced thus far. 
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• Section 5, US college football teams, with 

• a novel coherence score, measuring how well the detected modules correspond 
to an external classification of the units of the network. 

• Section 6, a gene network from yeast. This time, no "true" partitioning exists, 
and we apply the different methods introduced earlier in this article in order to 
see whether we can obtain a biological meaningful division of the network. 

• Section 7, a general discussion of the results obtained and conclusion of the paper 



2 Detecting community structures 

This section introduces the algorithms we utilize for detecting communities. It com- 
prises several ideas scattered through the literature, and its purpose is to make the 
article self-contained. However, the reader is still referred to the references for more 
detailed descriptions of the central concepts. 

2. 1 Distance between nodes 

First we need a way to measure distance between nodes in the networks. The most 
common way is to consider the geodesic, that is, the shortest path (counted in num- 
ber of links) connecting two vertices. The geodesic distance between two nodes is 
then just the minimum number of links which separate them. For future reference 
we denote the matrix of all pairwise geodetic distances as G. 

In [8] Girvan and Newman propose to define the distance between vertices as the to- 
tal number of paths that run between them. However, the number of paths between 
any two vertices is infinite (unless it is zero) so paths of length £ are weighted with 
a factor a 1 with a small, so that the weighted count of the number of paths con- 
verges. In this way, long paths contribute with less weight than those that are short. 
If A is the adjacency matrix of the network, such that is unity if there is an edge 
between vertices % and j and zero otherwise, then the distances are given by the 
elements of the matrix W, calculated as 

oo 

W = J2( aA Y = (I-aA)- 1 . (1) 

e=o 

For the sum to converge, a must be chosen smaller than the reciprocal of the largest 
eigenvalue of A. Both these definitions of distances give reasonable results for com- 
munity structures, but in some cases they are less successful. Different authors have 
used different approaches for the choice of the weights but it has not, to the best 
knowledge of the present authors, been made a systematic comparison between 
clustering algorithms implemented with different choices of distances. 
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Both measures result in a matrix (G or W) of dimension n x n (where n denotes 
the number of nodes in the network) whose elements are the distance between the 
nodes in the network. 

Eventually, we utilize as distance measure between node i and j the euclidean dis- 
tance between the ith and j'th row in one of the matrices (G or W) above. That is, 
we take 

dist(node i, node j) = - P lk ) 2 , (2) 

where P = G for the geodesic distance and P = W for the sum of all paths. 
This way, the issue of finding communities in a network becomes algorithmically 
identical to finding clusters from co-variation within a series of experiment, and 
standard routines available in, e.g., MatLab, R, etc. can be used. A similar approach 
with respect to (2) and hierarchical clustering, but with a different P, has been 
suggested in [24] and applied to a protein-protein interactions network. 



2.2 Community detecting algorithms 



The following two standard algorithms have been implemented and tested: 

• Hierarchical clustering. 

• K-means algorithm. 

Of course, our choice of tested algorithms is not intended to be exhaustive, but we 
have focused on some algorithms that are well-known clustering methods, but have 
not been applied systematically to complex networks. 

Hierarchical clustering [7] is an unsupervised procedure of transforming a distance 
matrix, which is a result of pair-wise similarity measurement between elements 
of a group, into a hierarchy of nested partitions. It is an agglomerative procedure, 
which means that it starts with as many clusters as there are nodes, i.e., each node 
forms a cluster containing only itself. Iteratively the number of clusters is reduced 
by a merging of the two most similar clusters until only one cluster remains. Once 
several nodes have been linked together, a linkage rule is needed to determine if 
two clusters are sufficiently similar to be linked together. Different linkage rules 
have been proposed. In this paper complete linkage, also called furthest neighbour 
is used. This method utilizes the largest distance between nodes in two groups. 
Explicitly, it takes the form 



\ ^ 5 * * * 7 Thf 

d(r, s) = max (dist(x ri , x Sj )j , (3) 

j = !,•••> n s 



ho 
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where n T and n s are the numbers of nodes in clusters r and s, respectively, and x r% 
denotes the ith node in cluster r. 

K-means [1 1] is a clustering algorithm that is widely used when working with tem- 
poral data where the elements should be grouped on the basis of their time profile. 
Data are clustered into N mutually exclusive clusters, where TV has to be chosen 
beforehand. It uses an iterative algorithm that minimises the sum of distances from 
each object to its cluster centroid, over all clusters. The algorithm moves the objects 
in a deterministic fashion between the clusters until the sum cannot be decreased 
further. The result is a set of clusters that is as compact and well- separated as pos- 
sible, given the initial partitioning. A drawback of the K-means algorithm is that it 
often converges to a local optimum. This problem can be somewhat remedied by 
choosing multiple starting points. For what we are aware of there have not yet been 
systematic studies of the use of K-means algorithm to detect community structure 
in networks. 



3 Zachary's karate club 

Zachary observed 34 members of a karate club over a period of 2 years. During 
the course of the study, a disagreement developed between the administrator of the 
club and the club's instructor, which ultimately resulted in the instructor leaving 
and starting a new club taking about half of the original club's members with him. 

Zachary constructed a network of friendship between members of the club, using 
a variety of measures to estimate the strength of ties between individuals. Here we 
use a simple unweighted version of his network with the attempt to identify the 
factions involved in the split of the club (see [8], [30]). The network 1 consists of 
34 nodes and 78 links as illustrated in Fig. 2; squares denote the supporters of the 
trainer (node 1) and circles represent the supporters of the administrator (node 34). 

When we apply the community detecting algorithms described in Sec. 2 above, we 
also need a way of deciding upon the number of communities we should have. One 
way of getting this number, when we do not have prior knowledge, is to measure 
the quality of the partitioning itself and simply pick the number which results in the 
"best" clustering. However, in accordance with the ambiguity mentioned in the in- 
troduction about what is meant by community structure, there exist several different 
measures for the quality of the partitioning of a set of nodes into communities. Here 
we will discuss two such measures, the Silhouette index [2] and the modularity [8]. 



1 The network can be downloaded from jhttp : //vlado . fmf . uni-1 j . si/pub/networks/data/UciNet /l 
and the graphical representation of the network is obtained from 

|http : / /www- personal . umich . edu/ ~me jn/ networks l\ see [8]. 
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Fig. 2. Zachary's friendship network of the karate club. 



3.1 Evaluation of the strength of the community structure, modularity 

The first measure we consider is Newman's modularity [19], which has become a 
kind of de facto standard for measuring the quality of a partitioning (although there 
are alternatives coming, e.g., [18,15]). 

The modularity is defined the following way: Given a particular division of a net- 
work into k communities, let e denote a k x k matrix whose element is the 
fraction of all edges in the network which connect vertices in community % to those 
in community j. The trace of this matrix Tre = Y<i e u gives the fraction of edges 
in the network that connect vertices in the same community. A good community 
division should have a high value of the trace, but the trace on its own is not a good 
indicator of the quality of the division since, for example, placing all vertices in a 
single community would give the maximal value of Tre = 1 without giving any 
information about the community structure. The row (column) sum is then defined 
as cii = Y,j e ij an d it represents the fraction of edges that connect to vertices in 
community i. In a network where edges fall between vertices without regard for 
the communities they belong to, it holds = a^a,. The modularity is therefore 
defined as 



It measures the fraction of edges in a community, minus the expected value of the 
same quantity in a network with the same community divisions but random connec- 
tions between the nodes. If a particular division gives no more within-community 
edges that would be expected by random chance the modularity is zero. Values 
other than indicate deviations from randomness, and as a rule of thumb, values 
above 0.3 indicate a modular structure [22]. In practice, values above 0.7 are rare, 
and indicate a very clear structure. However, also Erdos-Renyi (ER) random graphs 
can possess a very high modularity, as shown in [9]. The reason is that there are so 
many different ways to partition a network, that it is likely that there should be at 
least one partition where the intra-density of links within a cluster greatly exceeds 
the one obtained by chance. Because of this, we consider below in Section 3.3 a 
null hypothesis where the networks are rewired. 




(4) 
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We apply the K-means and the hierarchical clustering methods to the Karate club, 
both when the distances are the geodetics and when they are a sum over all paths. 
By calculating the modularity for every possible number of partitions, from one up 
to 34, we can easily see what the "best" division is. For each partition the modular- 
ity Q is computed and is plotted in Fig. 3 (for the K-means, we took the division 
among the repetitions with the same number of communities resulting in the highest 
Q, and discarded the rest). We can immediately see that the hierarchical algorithm 
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Fig. 3. Modularity Q for the communities detected in Zachary's karate club. Solid curve: 
hierarchical clustering (shortest path), dashed curve: K-means clustering (sum of paths), 
dotted curve: K-means clustering (shortest path), dash-dotted curve: hierarchical clustering 
(sum of paths). 

applied with the distance given by the sum of all the paths (1) gives always negative 
values of Q; this means that the partitions obtained do not correspond to communi- 
ties of the network. The other algorithms, instead, can all partition the network into 
communities. For each algorithm we have analysed the partition corresponding to 
the highest modularity Q, and the results are presented in Table 1 . 2 

The K-means algorithm used with the sum of paths distance between nodes detects 
two communities and all the nodes are partitioned according to the sociological 
division that took place in the club (circles and squares of Fig. 2 are divided into 
two separate groups). When both the K-means algorithm and hierarchical cluster- 

2 We note that our peak value, Q = 0.4198, slightly exceeds the maximum we have found 
in the literature, Q = 0.4188, in [5]. 
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Table 1 



Karate club: highest modularities obtained using different approaches and their correspond- 
ing community structure computed for the true network (Q) and for the rewired network 
representing the null hypothesis (Qh )- 



Algorithm 


Q 


No of clusters 


Qh 


No of clusters 


Hierarchical (shortest path) 


0.42 


4 


0.22 ±0.04 


9 


K-means, (sum of paths) 


0.37 


2 


0.15 ±0.02 


9 


K-means, (shortest path) 


0.34 


4 


0.18 ±0.07 


3 



ing are used with the distance given by the shortest path, they detect 4 communities. 
Looking in details we see that in both cases the 4 communities represent a further 
division of the 2 factions created in the club. This means that the union of com- 
munities 1 and 2 is exactly the group of supporters of the trainer and the union of 
communities 3 and 4 contains all the supporters of the administrator. 



3.2 Evaluation of the strength of the community structure, Silhouette index 

The second measure we discuss is the Silhouette index [25,2,3]. This measure is 
wide-spread in the context of clustering based on co-variation over several exper- 
iments, and since we explore such methods here, a discussion of the measure is 
most appropriate. 

For each cluster, one can calculate the Silhouette index, Sj, which characterizes the 
heterogeneity and isolation properties of the cluster. The Global Silhouette index, 
GS, is the mean of all the Silhoutte indices (one for each cluster) for the set, and 
can be used as an effective validity index. 

A drawback of the Silhouette index is that a community consisting of only one node 
is considered to be a perfect partitioning, i.e., the confidence indicator becomes 
unity. Thus, this measure will be inclined towards such clusters, which makes it a 
less suitable candidate for measuring the quality of a partitioning. This effect can 
be eliminated by modifying the average by discarding all the terms in the sum cor- 
responding to clusters with only one element. This is what has been implemented 
here. 

This index has been computed for all the partitions detected by the hierarchical 
algorithm with both the shortest path distance and the sum of paths distance be- 
tween nodes, i.e., for one partitioning which works reasonable and for one which 
gives nonsensical results for the modularity. The results are shown in Fig. 4. The 
figure shows that in both cases the Silhouette index has the maximum value for 
N = 2, that is, when the set is divided into two clusters. Checking the correspond- 
ing partitions we see that when the hierarchical clustering (shortest path) is used, 
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Fig. 4. Global silhouette index GS (solid curve) for the communities detected in Zachary's 
karate club with hierarchical clustering (left: shortest path, right: sum of paths). The mod- 
ularities Q (dashed curve) are also shown as comparisons. 

the maximum of the Global Silhouette index really corresponds to the real structure 
of the set. On the other hand the partition corresponding to two communities de- 
tected by hierarchical clustering (sum of paths) has one cluster containing only one 
node (node 34: the administrator) and all the other nodes are grouped together. This 
partition is clearly quite far from the original one while the Global Silhouette index 
still has a numerical value that in literature is considered to indicate a community 
structure. 

The Global Silhouette index is obviously less suited for estimating the quality of 
a network partitioning, and in the sequel of the present paper we will only utilize 
the modularity for measuring the degree of community structure. Further, since the 
algorithm based on hierarchical clustering (sum of paths) does not seem to work, 
we will also exclude that one from all further considerations. 

3.3 Rewiring 

The results obtained are here tested with respect to the null hypothesis that the 
found community structure is solely due to the degree distribution. This testing is 
necessary, since it has been shown that also random networks can possess a high 
modularity [9]. Each network is rewired according to the algorithm presented in 
[14]. This algorithm randomly rewires a network while preserving the degrees of 
each node. In this way, we investigate if the communities detected by the algorithms 
might have occurred by the degree distribution only, or if they represent an intrinsic 
structure of the network beyond the most obvious one. The three different clustering 
schemes (hierarchical clustering (sum of paths) is discarded) are applied to the 
rewired network and the modularity Q is computed. This procedure is repeated 100 
times, and the results are shown in Table 1 . For each clustering procedure the mean 
of the Q value is computed over all the repetitions. The Q value is reported together 
with the corresponding standard deviation. The values of the modularity obtained 
for the rewired networks are much smaller with respect to the true network; this 
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shows that the partition detected in the network have not occurred from the degree 
distribution only. 



3.4 Distance between partitions 

A less studied property, at least in the recent physics literature on networks, is how 
to compare two different partitionings of the same network with each other. Dif- 
ferent definitions of distance between partitions can be found in the literature (see, 
e.g., [16] for an early comparison of different measures), and here we discuss two of 
these. In the next subsection, we will compare these measures with two indices that 
have recently been advocated within this context. For both measures, we need the 
concept of meet of two partitionings. Let A and B denote two partitionings of the 
whole set, with corresponding elements a^, i = 1, • • • ,\A\ and bi, % = 1, • • • ,\B\, 
which are sets themselves. The meet is then the set C given by 



The two distance measures we discuss here are: 

Amoved: The distance between the partitioning A and the meet C, given the partitioning 
B, is defined as the minimum number of elements that must be moved between 
the partitions so that A and C become identical [10]. The distance between A 
B, also denoted as the total distance, is then obtained as the sum of the distance 
between A and C and the distance between B and C. (Alternative, but equivalent, 
definitions can be found in [10,29].) 
m div : The distance between the partitioning A and the meet C, given the partitioning 
B, is defined as the minimum number of divisions that must be implemented in 
A so that A and C become identical [27]. The distance between A and B, the 
total distance, is then obtained in the same way as for m moved , i.e., as the sum of 
the two partial distances. 

These definitions give different results when applied to general partitions. In order 
to understand how two partitions are related to each other, both measures are useful, 
as shown by the example below. 

Example 1 (Distances between two partitions) 

Let n = 9, and let A and B be the two partitions A = {{1, 2, 3, 4, 5, 6}, {7, 8, 9}} 
and B = {{1, 2, 4, 5, 7, 8}, {3, 6}, {9}}. The meet from (5) is then given by C = 
{{1, 2, 4, 5}, {3, 6}, {7, 8}, {9}}. Themeasures computed according to the two meth- 
ods above are shown in Table 2. It shows that the two definitions of distance be- 
tween partitions may give different measures. The information provided by the two 
methods is significant, in fact both the information on how many divisions must be 
performed or how many elements must be moved are relevant in order to find out 



\A\ \B\ 




(5) 
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the relationship occurring between partitions. The distance from partition A to the 
meet is 3 according to the first method, m move( i, and this means that three elements 
(elements 3, 6 and 9) must be moved but we don't know how these three elements 
are grouped in partition B. This information is given by the measure computed with 
the second method, m div : only two divisions are necessary and this implies that ele- 
ments 3,6 and 9 belong to two different clusters in partition B. When comparing 
partition B with respect to A, the measure m moved gives the distance from partition 
B to the meet equal to two, in fact elements 7 and 8 must be moved in order to 
obtain two partitions with the same clusters. Furthermore, the measure m di y says 
that these two elements belong to the same cluster in A and therefore it can be seen 
as a subcluster. If the distance computed by m div is much smaller than the distance 
given by m moved , it means that many subpartitions are present; elements belonging 
to the same original partitions are grouped together. 

Table 2 

Distance between partitions A and B of Example 1 where distA (distB) denotes distance 
from A (B) to the meet and distAB the total distance between partitions. 



Method 


distA 


distB 


distAB 


Amoved 


3 


2 


5 


m d iv 


2 


1 


3 



For the partitions of Table 1 for the Karate club, the distance from the original 
division of the set (the one represented in Fig. 2) has been computed using both 
definitions given above. The results are illustrated in Table 3. 

Table 3 

Karate club: distance between the partitions with highest Q (P) and the meet, and the 
original division (Pp) and the meet. 



Algorithm 


No of clusters 


m moved 


mdiv 






distP 


distP 


distP 


distP 


Hierarchical (shortest path) 


4 





11 





2 


K-means (sum of paths) 


2 














K-means (shortest path) 


4 


1 


16 


1 


3 



Table 3 shows that the partition produced by hierarchical clustering (shortest path), 
P, represents a subpartition of the original division, P ; in fact the distance from 
the obtained partition P to the meet is zero. K-means algorithm (shortest path), 
instead, misclassifies one element: this can be seen from the fact that the distance 
m movet j from P to the meet is one meaning that one element must be moved in 
order to have the partition P equal to the original division. K-means algorithm with 
distance given by the sum of all paths, clearly, detects a partition that is identical to 
the original division occurred in the club. 
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3.5 Similarity between partitions 



Two well-established indices for the similarity between two partitionings of a set 
are the Jaccard index and the mutual information index. They are defined as: 

Iy. The Jaccard index represents a measure of agreement between two partitions A 
and B. It is defined as [13] 

h{A, B) — Uu (6) 
n u + «oi + n 10 

where nu denotes the number of pairs of elements that are simulataneously 
joined together in partition A and B, n i (n w ) denotes the number of pairs of 
elements that are joined (separated) in A and separated (joined) in B. It results 
in a matching coefficient in a range [0 1] where a value of 1 indicates that the 
two partitions are identical. The Jaccard index was originally developed to assess 
similarity among distributions of flora in different geographic areas [12]. 
Im/a- The normalized mutual information index is a measure of similarity between 
partitions A and B [13,28]. It is based on the mutual information between the 
partitions when the two partitions are treated as (nominal) random variables. 
The normalized mutual information index can be expressed as 

\A\ \B\ 

-2EE<io g 

W A, B) = - m '—^ , (7) 




E"? log (^)E^g 




where n\ represents the number of units in cluster a, L and nf- denotes the number 
of shared elements between clusters a* and bj. It can be shown that < J N mi < 1 
with /nmi(^) A) = 1. 

In order to compare these two indices with the two distance measures introduced 
in the previous section, we turn the measures into indices by 

T Amoved , T /ON 

Amoved = 1 and i div = 1 , (8) 

n n 

where n is the number of units in the partioned set. In Fig. 5 we show how these 
four indices vary for the partitionings obtained by hierarchical clustering (shortest 
path) for the Karate club, when we compare with the actual division which took 
place. 

The index J div is a straight line, which indicates that one more node has been mis- 
classified per extra community which is considered. The other three indices behave 
somewhat similar, with the exception of the mutual information, / NMI , which does 
not tend to zero for the limiting case when all nodes are placed in one cluster each. 
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u , , , , , ^ 

5 10 15 20 25 30 35 

Number of clusters 

Fig. 5. The four indices I& y (full curve), I m0V ed (dashed curve), Ij (dotted curve) and Tnmi 
(dashed-dotted curve) as function of number of clusters uncovered by hierarchical cluster- 
ing (shortest path) for the Karate club. 

As discussed in the previous section, the measures m d i v and m moved can be inter- 
preted as the number of divisions and the number of movements, respectively, 
which are necessary for making two sets coincide. Further, for m div and m moved we 
obtain also partial measures which directly show if one partition is a subpartition 
of the other. Because of this direct interpretation, the property of showing subpar- 
titions, and the similarity with the other measures when considered as indices, we 
stick in the sequel of this paper to m div and m moved . 



4 Computer generated networks 

Now we consider a class of computer generated random networks. These networks, 
originally introduced in [8], consist of 128 nodes each, divided into four communi- 
ties of equal size. The links are distributed randomly, with the same probability for 
a link to occur for each pair of intra-community nodes, and another constant proba- 
bility for each pair of inter-community nodes, such that the average degree of each 
node becomes 16. Intra-community degrees, i.e., the part of the degree that stems 
from links within the same community, are denoted as z- m , and the inter-community 
degree as z out . 

By applying the algorithms introduced above (100 times) to the networks with < 
Zout < 10, we can compare our methods to some others studied in the literature. In 
Fig. 6 we show the fraction of correctly classified nodes for our three algorithms 
for different values of z out . It turns out that the K- means clusterings (both measures) 
give results which are comparable to methods that are very recently developed in 
order to take care of this community detection problem [5]. Indeed, the results of 
the two K-means algorithms are identical, which make these computer generated 
networks unique among the ones we study. This might reflect that these networks 
are in some sense less complex than the real-world networks, but this issue deserves 
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a further investigation before any definitive conclusions can be drawn. The fraction 
of nodes that are classified correctly by hierarchical clustering (shortest path) is 
lower, but the result still makes sense as 60% of nodes are correctly classified when 

^out — 8. 



1.1 i 1 1 1 1 1 1 1 200 




Average number of inter-community edges per vertex 

Fig. 6. Fraction of correctly classified nodes (upper curves for low z out ) and distances from 
the true partition (lower curves for low z out ) for computer generated networks, as measured 
by mdiv and m moV ed (m-dw < "i, move d)- Solid curves: hierarchical clustering (shortest path), 
Dashed-dotted curves: K-means clustering (shortest path and sum of paths coincide). Each 
point is an average over 100 different networks. Typical values of the modularity, obtained 
from the true partition, are marked. 



However, to measure only the fraction of correctly classified nodes might give a 
wrong picture of performance, as noted also in [4] . For instance, if the community 
detecting algorithm happens to divide one correctly identified cluster into two, it 
is normally not a serious error, but still the fraction correctly classified nodes will 
decrease drastically. 

If we instead consider the measure m div introduced above, we can see that the de- 
viation from the real structure is quite modest. In Fig. 6, we also depict the two 
distance measures m move d and m div (total distance), which in some sense give a bet- 
ter description of the outcome from the algorithms. Here, however, we clearly see 
how these different descriptions give similar results. 
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5 College football 



United States college football is now considered [8]. The network represents the 
schedule of Division 1 games for the year 2000 season: the 115 vertices in the 
graph represent teams (identified by their college names) and the 613 edges rep- 
resent regular-season games between the two teams they connect. The community 
structure is well-known. Teams, in fact, are divided into 13 conferences containing 
around 8-12 teams each. Games are more frequent between members of the same 
conference than between members of different conferences, with teams playing an 
average of about seven intra-conference games and four inter-conference games 
in the year 2000 season. Interconference play is not uniformly distributed, teams 
that are geographically close to one another but belong to different conferences are 
more likely to play one another than teams separated by large geographic distances. 

We have repeated the same clustering procedures as above to this data and the 
results are reported in Fig. 7 and the most prominent features are emphasized in 
Tables 4 and 5. 3 All algorithms are able to partition the network into communities. 
The highest value of the modularity Q for each algorithm has been determined 
and the result is reported in Table 4 together with the corresponding community 
structure and the results of the null hypothesis test. 

Table 4 

College football network: highest modularities obtained using different approaches and 
their corresponding community structure computed for the true network (Q) and for the 
null hypothesis (Qh )- 



Algorithm 


Q 


#clusters 


Qh 


#clusters 


Hierarchical (shortest path) 


0.46 


8 


0.23 ±0.010 


7 


K-means (sum of paths) 


0.60 


11 


0.28 ± 0.006 


5 


K-means (shortest path) 


0.60 


10 


0.27 ±0.007 


5 



The K-means algorithm gives similar results in both cases with distance between 
nodes given by (1) and given by the shortest path between the nodes. Two different 
partitions are determined with the only difference being that in the latter partition 
one conference is broken into two pieces and grouped with two other conferences 
This is the reason why we in Fig. 7 (a) and (b) only show the results from one of 
the K-means algorithms and from the hierarchical clustering. 

The resulting partition is very close to the real conference structure of the 2000 
season. The fact that only 1 1 partitions are detected is because two of the confer- 
ences are identified as one conference (this happens also with the other algorithms). 

3 The modularity for the true division into conferences is Q = 0.537, which is smaller 
than the peak we obtain, Q = 0.6044. 
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Fig. 7. College football network, (a) Modularities Q (solid curve) and coherence scores 
(Z-values) (dashed curve) depicted as functions of the number of clusters when the partition 
comes from hierarchical clustering (shortest path), (b) Same as (a) but for K-means (sum of 
paths), (c) Distances between the detected partition P and the meet, and between the true 
partition Pq and the meet, as functions of the number of clusters, when the partition comes 
from hierarchical clustering (shortest path). Solid curve: m move( j(P, meet), dashed curve: 
Amoved (-Fb) meet), dotted curve: mdw(P, meet), dashed-dotted curve: mdw(Po, meet), (d) 
Same as (c) but for K-means (sum of paths). 

There are 5 independent teams that do not belong to any conference and they tend 
to be grouped with the teams with which they are most closely associated. There 
are only three teams that are misclassified. 

For the partitions of Table 4 the distance from the original division of the set has 
been computed using both measures m moved and m div . The results are illustrated in 
Table 5 and in Fig. 7 (c) and (d). 

Here it is clear that the partition closer to the original set is the one detected by K- 
means algorithm (sum of paths) as distance between nodes, in fact only 14 elements 
must be moved in order to make the set coincide with the meet. 

Both partitions obtained with K-means algorithm (using both definitions of dis- 
tance between nodes) correspond to high values of the modularity Q so it can be 
interesting to see how distant these two partitions are between each other. Let 
P Kw denote the partition obtained with the K-means algorithm (sum of paths) 
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Table 5 

College football network: distance between the partitions (P) with highest Q and the meet, 
and between the original division (Pq) and the meet. 



Algorithm 


No of clusters 


Amoved 


TTJdiv 






distP 


distP 


distP 


distP 


Hierarchical (shortest path) 


8 


50 


24 


25 


20 


K-means (sum of paths) 


11 


14 


5 


5 


3 


K-means (shortest path) 


10 


21 


9 


9 


6 



(N = 11), and P Ka the partition obtained with the K-means algorithm (short- 
est path) (N = 10). The distances to the meet from respective partition become 
m moved (P Kw , meet) = 6, m movsd (P Ka , meet) = 12, m div (P Kw , meet) = 2, and 
fndiv{PK G , meet) = 3. The total number of elements to be moved in the two parti- 
tions in order for them to become identical is 18. If 6 elements are moved within 
the communities of Pr w , then P Ka becomes a subpartition (as the number of par- 
titions is different). These two partitionings are thus closer to each other than any 
of them to the "true" partitioning, hence at least in this case the choice of algorithm 
seems less crucial. 



Coherence score 

If the true partition of the network is not known, we have to turn to another way 
of validating the structure than to measure the distance. A common case is that we 
have annotations (one or many) to each node, and that we should utilize in some 
way the property that some of these annotations are common to more than one node. 
Here we propose the following method, illustrated by the simplest case where each 
node has exactly one annotation: 

Assume every node in the network has a classification — here it is the conference 
to which it belongs. To judge the quality of a specific community partitioning, let 
us take one of the modules we detected as an example. This module contains in 
total ten teams, with eight teams from one conference and two teams from another 
conference. Given the actual sizes of these two conferences, the total number of 
teams and the total size of this module, we can from these numbers calculate the 
probability that the eight teams would come from the actual conference should 
have occurred by chance (from the hypergeometrical distribution), and the same 
probability for the two teams from the other conference. 4 These probabilities are 
the p-values for this instance, and we assign the lowest one we find to this module. 
For future processing, we store the negative logarithm of this p- value. 

4 For reasons that will become clear in the next section, we do not calculate the total 
probability for this event. 
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However, it is problematic to consider these p-values as true probabilities since 
there are many different tests we perform (one for each class of nodes which exists 
within the actual module), i.e., this is a multiple testing problem. It is likely that 
unlikely things should occur. Therefore we pick eleven random teams and calculate 
the p- values in the same way again, keeping the lowest one. By repeating this many 
times, we obtain a distribution of (negative logarithms of) p-values which has a 
mean, p, and a standard deviation, o p . From this, we can construct standard Z- 
scores as 

Z = P -^1 (9) 

Op 

These Z-scores indicate how unlikely the present distribution is, such that the 
higher Z-score, the more improbable the actual distribution, and the more relevance 
our found module has. That is, they form a coherence score for the community. 

By considering all modules simultaneously, and proceeding in the way described 
above, we obtain a global coherence score for the whole community structure. In 
Fig. 7, we show the modularity, the distances from the true partition (both m moved 
and m div ) and the coherence- scores for different number of communities. The al- 
gorithms employed are hierarchical clustering (shortest path) in (a) and (c), and 
K-means (sum of paths) in (b) and (d). We clearly see how the coherence score and 
the modularity co-variate, such that they essentially peak at the same number of 
clusters. Remembering that the modularity is a graph-theoretical measure relying 
only on network properties (paying no attention to what the nodes represent) and 
that the coherence score does not utilize that the underlying structure is a network 
but only focus on the annotations of the nodes, this is a remarkable result. Further, 
we see in (c) and (d) how the total distances, both m move( j and m^v, have minima 
approximately where the modularity peaks. These observations support the tacit 
assumption in many previous papers that the partition with the highest modularity 
also corresponds to the one that makes most sense. 



6 Gene network of Saccharomyces cerevisiae 

As our final example, we consider a somewhat larger network where no true parti- 
tion exists (or at least is known). The network represents the interactions between 
regulatory proteins and genes in Saccharomyces cerevisiae (ordinary yeast) [17]. 
The 690 nodes represent genes, 5 and the 1079 edges correspond to biochemical 
interactions. The edges are directed from a gene that encodes for a transcription 
factor to a gene transcriptionally regulated by that protein, i.e., we have a directed 
network. However, here we remove all directionality from the edges in order to 
more easily apply the same algorithms and measures as before in this paper. 

5 A few are "pseudogenes", which here means that they are complexes, composed by two 
or more gene products. 
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We have partitioned this gene network into several communities using the algo- 
rithms described above. For each number of communities, we have as before cal- 
culated the modularity (4) and the coherence (9), see Fig. 8. 




'o 50 100 150 200 250 300 350 °' 2 50 100 , 150 200 250 300 350 

Number of clusters Number of clusters 



Fig. 8. Modularity (solid curve) and coherence (dashed curve) for the communities detected 
in the gene network of Saccharomyces cerevisiae. The coherence is scaled down with a 
factor of 100. Left: Hierarchical clustering (shortest path). Right: K-means clustering (sum 
of paths). 

The external validation for calculating the coherence is this time somewhat less 
straightforward. We employ the Gene Ontology (GO) database [1] in order to find 
a classification for the nodes/genes of the network. In this database, the genes of 
S. cerevisae (and several other organisms) are arranged in a directed acyclic graph 
according to which biological process they belong. 6 A gene is assigned both the 
ontology term for the process it belongs to, as well as all terms for the parental pro- 
cesses in the graph, i.e., there are many different terms associated with each gene. 
To judge the quality of a specific community partitioning, we query the database 
with a list of all genes in each community. It is because of these multiple annota- 
tions of all genes we cannot calculate one probability for the whole network divi- 
sion, as remarked in the previous section. These p- values are then treated the same 
way as for the football teams, and we obtain coherence scores from the same kind 
of null hypothesis as before. In Fig. 9 we show the distribution of all coherence 
scores for the divisions into 22 and 177 communities, respectively, obtained by hi- 
erarchical clustering (shortest path), with corresponding Q-values of Q = 0.67 and 
Q = 0.51 (see Fig. 8). 

This time, we cannot measure the distance from any kind of "true" partition. In- 
stead we do a pairwise comparison between different methods. In Fig. 10 we show 
the normalized 7 total distances m moved and m div between some different partition- 

6 They can also be ordered according to biological function or cellular localization, prop- 
erties we do not utilize here. 

7 Without prior knowledge of the network, it is hard to tell whether a distance is small or 
large just from the numbers m move d and mdiv Therefore, we normalize these numbers by 
dividing them by the distances obtained as the mean of a sample of random partitionings 
with the same number of clusters and the same number of units in each cluster as for the 
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nitrogen utilization 



reproduction 

DNA packaging 
biopolymer metabolism 

phospholipid biosynthesis 
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#Standard deviations 

Fig. 9. Coherence scores (Z-values) for the biological process terms, associated to each 
community. The light grey bars correspond to a division into 22 communities obtained with 
hierarchical clustering (shortest path) and has Q = 0.67. The dark grey bars correspond to 
177 communities, obtained with the same algorithm and with Q = 0.51. The widths of the 
bars are proportional to the number of genes in each community. 

ings. To the left, the figure shows the normalized distances between the very "best" 
(highest modularity) division obtained with K-means (shortest path) and the divi- 
sions obtained by the same algorithm for the number of communities depicted along 
the x-axis. To the right, the figure shows the normalized distances between the par- 
titioning obtained by hierarchical clustering (shortest path) and the one obtained by 
K-means (shortest path), for the number of communities depicted along the x-axis. 
Remarkably, we see how these distances have their minima not far from where the 
modularity and the coherence score have their peaks. By definition, the minima in 
the left panel have to be exactly zero, but worth noting is how sharp these extrema 
are. The minima in the right panel are broader, but still in a clear neighbourhood 
of the maxima of the modularity and the coherence score. The conclusion is that 
the outcome of the different clustering algorithms seem to coincide, as long as we 
are satisfied with looking at the "best" partition (in the sense of having the highest 
modularity). However, if we for some reason do not strive for this optimum, the 
procedures can give rise to very different results. 
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Number of clusters Number of clusters 



Fig. 10. Normalized total distance between the very best division obtained by K-means 
(shortest path) and K-means (shortest path) with a various number of communities (de- 
picted at the x-axis) (left), and between partitionings obtained by hierarchical clustering 
(shortest path) and K-means (shortest path) for the same number of communities (right). 
Solid (upper) curves are m mov ed> while the dashed (lower) curves are m& y . 

7 Discussion and conclusions 

We have in the present article presented various aspects of how to find and evaluate 
community structures in complex networks. The issue of finding such a structure 
is closely related to clustering of data based on similarity over many experiments, 
without any underlying network structure assumed. This latter exploration is a stan- 
dard tool today, and there are many well-developed algorithms for handling the 
issue. Here we show explicitly that at least some of these algorithms can be used 
also for finding communities, indeed, one of them provided the best partitioning 
in terms of high modularity we have found in the literature thus far. A drawback, 
however, is that these algorithms are computationally less efficient. 

Since there is no generally accepted definition of what really constitutes a reason- 
able partitioning, we explored both Newman's modularity and the so-called Sil- 
houette index. The former has become some kind of de-facto standard, and many 
authors simply equate a proposed modular structure with having a high modularity. 
As a contrast, we also investigated the Silhouette index, but found rapidly that it 
was less suitable for networks. 

The question of how distant two partitionings of the same network are is somewhat 
new in the present physics literature, 8 although the issue by no means is original. 
We described two different such measures and discussed advantages and drawbacks 
of each. A combined use of both yields of course a better description. We also 
turned these to measure into indices and compared them with the Jaccard index 
and the mutual information index, showing that the behaviour was similar among 



8 To the best of our knowledge, the only exception is [4], which one of the referees draw 
our attention to. 
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those. 



Finally, we proposed the introduction of a coherence score, indicating the validity 
of the partitioning. This score is especially well-suited when each node in network 
has several annotations, or belongs to annotations presented as a directed acyclic 
graph, where children inherit all the parents' annotations. This is the case for the 
Gene Ontology database, which has become very popular recently in the bioinfor- 
matics community. It turns out that the modularity and the coherence score peak for 
approximately the same number of communities, at least for the networks we have 
considered here. Hence, this observation gives support to the standard procedure of 
only striving for optimizing the modularity. 

In summa, the issue of finding and evaluating community structures in complex 
networks is an important part in the unraveling of properties for the systems. Here 
we have discussed various aspects of these themes, and also introduced the new 
concept of "coherence" for a network. We strongly believe this can be a valuable 
tool in the future for the exploration of various networks. 



Acknowledgment 

The authors thank M. Girvan and M.E.J. Newman for providing the data of the 
college football network. Financial support from CENIIT (Centre for Industrial IT 
at Linkoping University) and from the Carl Trygger's foundation is acknowledged. 

References 



[1] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. 
Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, 
A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and 
G. Sherlock. Gene ontology: tool for the unification of biology. Nat. Gen., 25:25-29, 
2000. www. geneontology . org/, visited December 21, 2003. 

[2] F. Azuaje. A cluster validity framework for genome expression data. Bioinformatics, 
18:319-320, 2002. 

[3] N. Bolshakova and F. Azuaje. Cluster validation techniques for genome expression 
data. Signal processing, 83:825-833, 2003. 

[4] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas. Comparing community 
structure identification. Journal of Statistical Mechanics: Theory and Experiment, 
2005(09):P09008, 2005. 

[5] J. Duch and A. Arenas. Community identification using extremal optimization. 
Physcial Review E, 72, 2005. 



23 



[6] T. Evans. Complex networks. Contemporary Physics, 45(6):455-474, 2004. 

[7] D. Fisher. Iterative optimization and simplification of hierarchical clusterings. Journal 
of Artificial Intelligence Research, 4:147-179, 1996. 

[8] M. Girvan and E. Newman. Community structure in social and biological networks. 
PNAS, 99(12):7821-7826, June 2002. 

[9] R. Guimera, M. Sales-Pardo, and L. A. N. Amaral. Modularity from fluctuations in 
random graphs and complex networks. Physical Review E (Statistical, Nonlinear, and 
Soft Matter Physics), 70(2):025101, 2004. 

[10] D. Gusfield. Partition-distance: A problem and class of perfect graphs arising in 
clustering. Information Processing Letters, 82(3): 159-164, 2002. 

[11] Z. Huang. A fast algorithm to cluster very large categorical data sets in data mining. 
In Proceedings of SIGMOD Workshop on Research Issues on Data Mining and 
Knowledge Discovery, Tucson, Arizona, 1997. 

[12] P. Jaccard. The distribution of flora in the alpine zone. The New Phytologist, 1 1(2): 37- 
50, 1912. 

[13] L. I. Kuncheva and S. T. Hadjitodorov. Using diversity in cluster ensembles. In 2004 
IEEE International Conference on Systems, Man and Cybernetics, pages 1214-1219, 
2004. 

[14] S. Maslov and K. Sneppen. Specificity and stability in topology of proteins networks. 
Science, 296:910-913, 2002. 

[15] C. P. Massen and J. P. K. Doye. Identifying communities within energy landscapes. 
Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 71(4):046101, 
2005. 

[16] G. W. Milligan. A monte carlo study of thirty internal criterion measures for cluster 
analysis. Psychometrika, 46(2): 187-199, 1981. 

[17] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and A. U. Network 
motifs: Simple building blocks of complex networks. Science, 298:824-827, 2002. 

[18] S. Muff, F. Rao, and A. Caflisch. Local modularity measure for network 
clusterizations. Physical Review E, 056107, 2005. 

[19] M. Newman. The structure and function of complex networks. SIAM Review, 45:167- 
256, 2003. 

[20] M. Newman. Detecting community structure in networks. Eur. Phys. J. B, 38:321-330, 
2004. 

[21] M. Newman and M. Girvan. Finding and evaluating community structure in networks. 
Phys. Rev. E, 69:026113, 2004. 

[22] M. E. J. Newman. Fast algorithm for detecting community structure in networks. 
Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 69(6):066133, 
2004. 



24 



[23] E. Ravasz, A. Somera, D. Mongru, Z. Oltvai, and B. A.-L. Hierarchical organization 
of modularity in metabolic networks. Science, 297:1551-1555, 2002. 

[24] A. W. Rives and T. Galitski. Modular organization of cellular networks. Proc. Nat. 
Acad. Sciences USA, 100(3): 1128-1 133, 2003. 

[25] P. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster 
analysis. /. CompApp. Math, 20:53-65, 1987. 

[26] T. Speed, editor. Statistical Analysis of Gene Expression Microarray Data. 
Interdisciplinary Statistics Series. Chapman & Hall/CRC, Boca Raton, 2003. 

[27] R. P. Stanley. Enumerative Combinatorics, volume 1. Cambridge University Press, 
1997. 

[28] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for 
combining multiple partitions. Journal of Machine Learning Research, 3:583-617, 
2002. 

[29] S. van Dongen. Graph clustering by flow simulation. PhD thesis, Universiteit Utrecht, 
2000. 

[30] W. Zachary. An information flow model for conflict and fission in small groups. 
Journal of Anthropological Research, 33:452^473, 1977. 

[31] H. Zhou. Network landscape from a brownian particle's perspective. Physical Review 
E, 67:041908, 2003. 



25 



